Team Illinois Members


Introduction

The purpose of this project is to attempt to build models which can be used to accurately predict premature death based on various health and socio-economic factors. The data for this project were obtained from https://www.countyhealthrankings.org/. County Health Rankings ranks each county within each state based on a variety of health measures obtained from national and state data sources. One of the many attributes contained in the raw data is the “Years of Potential Life Lost Rate” (YPLLRate) which represents premature death for the a given geographical region. The YPLLRate value is the number of years lost per 100,000 people based on the various health and socio-economic factors present in the data. This project seeks to model the interaction between the YPLLRate serving as the response variable and the other attributes serving as the predictor variables.

The raw data for this project were obtained from https://www.countyhealthrankings.org/. County Health Rankings obtained this data from various nation-wide and state data sources. The raw data consisted of over 240 variables and the preliminary cleaning reduced this number to 96 by removing redundancies and sparsely populated variables.

The redundant variables included many examples of providing a quantity and a percentage of some factor. For example, there were variables for the number of smokers in a region as well as the percentage of the population that smoked. The percentage variables were chosen over the quantity variables and the quantity variable were omitted There were also many race-specific variables that had N/A values in too many observations to be useful as predictors. These variables were omitted as well.

The remaining predictor variables are made up of categorical geographic data such as state and county, health related numeric variables such as percentage of low birth weight, percentage of smokers, and percentage of adults with diabetes, and socio-economic numeric variables such as percentage of single parent households, percentage of people with some college, and percentage of people unemployed.

The objective of this project is to identify which combination of the remaining predictors most-contributes to premature death.

Methods

The large number of predictors in the cleaned data set made model selection very difficult. No single approach seemed to emerge as the best route so we opted to pursue several approaches and compare the results. Model-selection algorithm proved to be computationally infeasible in that it ran for exceptionally long times and the inclusion of the State and Region categorical variables made the number of permutations far too high to be useful. Our three approaches are similar in some ways and different in others as detailed below.

Test Statistics used:

  • adjusted R-squared
  • LOOCV RMSE
  • Standard Deviation
  • Correlation Coefficients
  • Partial Correlation Coefficients
  • Shapiro-Wilk Test
  • Breusch-Pagan-Godfrey Test

Approach 1

This approach involved attempting to manually select a smaller set of predictors based on their textual descriptions and their individual linear relationships to the response variable.

data_1 = read_csv("data/model_1_data.csv")
pairs(data_1)

Below are plots of the relationships between the response and several predictors.

Transforming predictor variables had no effect on model improvement, but performing a log transformation on the YPLLRate response variable did show improvement. The model was selected by a forward AIC selection with the scope being set to the full interactive model of the chosen predictors.

empty_model = lm(log(YPLLRate) ~ 1, data = data_1)

selected_model = step(empty_model, scope = log(YPLLRate) ~ HealthBadDaysMentAvg * HospPrevRate * 
                                                           Income20thpcntl * InjuryDeathRate * 
                                                           LBWPct * LifeExpect * 
                                                           MammAnnualPct * SmokersPct *
                                                           UnemployedPct * UninsAdultPct * 
                                                           VaccinatedPct, 
                                                           direction = "forward", trace = 0)

influential_points = which(cooks.distance(selected_model) > 4 / length(cooks.distance(selected_model)))
outliers = as.vector(as.integer(names(rstandard(selected_model)[abs(rstandard(selected_model)) > 2])))
noise = c(outliers, influential_points)

new_data_1 = data_1[-noise, ]
new_empty_model = lm(log(YPLLRate) ~ 1, data = new_data_1)

final_model_1 = step(new_empty_model, scope = log(YPLLRate) ~ HealthBadDaysMentAvg * HospPrevRate * 
                                                              Income20thpcntl * InjuryDeathRate * 
                                                              LBWPct * LifeExpect * 
                                                              MammAnnualPct * SmokersPct * 
                                                              UnemployedPct *
                                                              UninsAdultPct *  VaccinatedPct, 
                                                              direction = "forward", trace = 0)

Approach 2

In this method, we started by selecting a small number of predictors that showed a strong connection to the response variable “Years of Potential Life Lost Rate.” This selection was done based on the study of the variables from the data source https://www.countyhealthrankings.org/explore-health-rankings/rankings-data-documentation

This set of variables were added to smaller file to make the loading and knitting of the markdown document more efficient.

model_2_DF = read_csv("data//model_2_data.csv") 

model_2_DF$`WaterViol` = as.factor(model_2_DF$`WaterViol`)

model_2_DF = model_2_DF[sample(1:nrow(model_2_DF), 2000, replace=FALSE),]

# Split into test and train
indexes = sample(nrow(model_2_DF), 1000)
train_df = model_2_DF[indexes,]
test_df = model_2_DF[-indexes,]
  • Through a pairs plot of the variables we can see that some variables appear to have collinearity issues.
  • There are others that may benefit from transformations. We will explore this further.

  • Various transformations were applied to the predictor variables and only one of them showed a marked difference from the transformation. But this variable was dropped later in the final model derived in this section.

  • We started by fitting a model with all predictors. Fitted vs Residuals plot of this model shows outliers that could be influencing the model. In the next step we removed the ouliers and can see from the Fitted vs Residuals and Q-Qplot that it greatly improved the variance distribution.
# FULL model, with all predictors
model_2_full = lm(YPLLRate ~ . , data = model_2_DF)
# Check for Outliers and Influence.
outliers = as.vector(as.integer(names(rstandard(model_2_full)[abs(rstandard(model_2_full)) > 2])))
high_influence = as.vector(which(cooks.distance(model_2_full) > 4 / length(cooks.distance(model_2_full))))

# Remove outliers
remove_noise = c(outliers, high_influence)
model_2_DF_clean = model_2_DF[-remove_noise,]

model_2_full_cl = lm(YPLLRate ~ . , data = model_2_DF_clean)
(model_2_aic = step(model_2_full_cl, direction = "backward", trace=0))
## 
## Call:
## lm(formula = YPLLRate ~ LBWPct + ObeseAdultsPct + DrinkExcPct + 
##     HSGradRate + DistressFreqPhysPct + HIVRate + IncHousehldMedian + 
##     HomeownersPct, data = model_2_DF_clean)
## 
## Coefficients:
##         (Intercept)               LBWPct       ObeseAdultsPct  
##           2115.2234             292.2075              38.8823  
##         DrinkExcPct           HSGradRate  DistressFreqPhysPct  
##           -118.1679              14.6575             302.0279  
##             HIVRate    IncHousehldMedian        HomeownersPct  
##              0.3586              -0.0461              29.5451

We tried several approaches to improve this model but the best model we could achieve was by applying box-cox transform to the response variable of a model selected via backward-AIC. model

boxcox(model_2_aic, lambda = seq(-0.25, 0.75, by = 0.05), plotit = TRUE)

model_2_cox = lm(formula = (((YPLLRate ^ 0.6) - 1) / 0.6) ~ LBWPct + ObeseAdultsPct + DrinkExcPct +  HSGradRate + DistressFreqPhysPct + HIVRate + IncHousehldMedian +  HomeownersPct, data = model_2_DF_clean)

Plots

Approach 3

Data Cleaning:

We started with a script that merged two sheets from the original data. The team chose to keep variables that contained rates and to drop race demographics. We started with 92 variables and 3141 observations. For this approach, we then dropped columns with too many NA’s (more than our result YPLLRate) and then omitted observations containing ‘NA’. This left us with 71 variables and 2333 observations. Lastly, several variables were coerced to factors and short names were added.

Assessment of Model 3:

Model 3 has a good adjusted R-squared value of 0.972. LOOCV RMSE is 17.5 and when normalized with the standard deviation the normalized RMSE is 0.16. This is a pretty good value. When the data was split into test and train and run 1000 times, the normalized RMSE was 0.17, confirming our LOOCV RMSE result. The transformation with BoxCox lambda greatly improved the Q-Q plot and the tails were minimal. The Shapiro test had a high P-value which support the null hypothesis of normality of errors. The remaining problem is it fails the BPtest with BP value of 131 and p-value very low. The model still suffers from heteroscedasticity but does well on the other measures. BP value was much improved from over 300 before the BoxCox transform and removal of influential observations. Model 3 uses 23 predictors out of 93 possible and they are representative of the major contributors to health (environment, income, healthcare, demographics). Given the high correlation between many of the predictors and the high variance of the response, this is a decent model.

Residuals Plot

Q-Q Plot

Results

The three models are shown below:

Model_1

formula(final_model_1)
## log(YPLLRate) ~ LifeExpect + InjuryDeathRate + LBWPct + UninsAdultPct + 
##     UnemployedPct + Income20thpcntl + MammAnnualPct + LifeExpect:InjuryDeathRate + 
##     LifeExpect:LBWPct + LBWPct:Income20thpcntl + LifeExpect:UninsAdultPct + 
##     InjuryDeathRate:UninsAdultPct + UninsAdultPct:MammAnnualPct + 
##     UninsAdultPct:UnemployedPct + LifeExpect:UnemployedPct + 
##     InjuryDeathRate:UnemployedPct + LifeExpect:Income20thpcntl + 
##     InjuryDeathRate:Income20thpcntl + InjuryDeathRate:MammAnnualPct + 
##     LBWPct:MammAnnualPct + InjuryDeathRate:LBWPct + LifeExpect:UninsAdultPct:UnemployedPct + 
##     LifeExpect:InjuryDeathRate:UninsAdultPct + LifeExpect:LBWPct:Income20thpcntl

Model_2

formula(model_2_cox)
## (((YPLLRate^0.6) - 1)/0.6) ~ LBWPct + ObeseAdultsPct + DrinkExcPct + 
##     HSGradRate + DistressFreqPhysPct + HIVRate + IncHousehldMedian + 
##     HomeownersPct

Model_3

formula(model_3)
## YPLLRate^lmda_mdl3 ~ HealthPoorPct + FoodEnvirIX + PhysInactPct + 
##     PhysPrimCareRate + MammAnnualPct + Income20thpcntl + IncomeRatio + 
##     SocsAssRate + InjuryDeathRate + HousSvrProbPct + HousInadFacil + 
##     LifeExpect + DeathAgeAdjRate + VaccinatedPct + IncHousehldMedian + 
##     PopGE65Pct + PopNativePct + PopPacificPct + EnglishNotProfPct + 
##     PopRuralPct + PopGE65Pct:DeathAgeAdjRate + HealthPoorPct:PopFemalePct + 
##     LBWPct:IncHousehldMedian

Model Diagnostics

Model 1 Model 2 Model 3
BP Test Statistic Value 95.4644 34.6349 131.3602
Shapiro Wilk p-value 0.0746 0.0127 0.3390
No params 25.0000 9.0000 24.0000
LOOV RMSE 0.0535 27.1860 17.5007
Adj R2 0.9618 0.7934 0.9717
RMSE 0.0529 27.0614 17.3025
AIC -12514.0682 12273.7668 12397.8798
Normalized RMSE 0.1835 0.4240 0.1580

The table above summarizes the metrics obtained from each of the three models. Models 1 and 3 appear to have similar metrics Each of the models failed the Breusch-Pagan test indicating that the constant variance assumption has been violated. All three p-values were essentially 0 so the table above lists the BP test statistic instead. All 3 model passed the Shapiro-Wilk test indicating that the data were sampled from a normal distribution.

Discussion

cdnaomit = cooks.distance(mdl3t)
influential = which(cdnaomit > 4/length(cdnaomit))
Raw95nooutlier = Raw95naomit[-influential,]

set.seed(42)
num_sim = 1000

rmse_final = c(rep(num_sim, 0))
ratio_final = c(rep(num_sim, 0))
pcterr_final = c(rep(num_sim, 0))

for (i in 1:num_sim) {
  indexes = sample(nrow(Raw95nooutlier), (nrow(Raw95nooutlier)) /2)
  Raw95Train = Raw95nooutlier[indexes,]
  Raw95Test = Raw95nooutlier[-indexes,]
  
  #...dont need subset because already got rid of influential points
  #final: Calculate standard deviations for train, and for YPLLRate and YPLLRate^lmda_mdljd
  sdtrboxcox = sd(Raw95Train$YPLLRate^lmda_mdl3)
  
  #plain: This model has the response BoxCox transform
  mdljdfinal_trn = lm(formula = formula(model_3), data = Raw95Train)
  
  rmse_final[i]   = get_loocv_rmse(mdljdfinal_trn)
  ratio_final[i]  = get_loocv_rmse(mdljdfinal_trn)/sdtrboxcox
  
  #Prediction with Test Values
  pcterr_final[i] = get_perc_err(Raw95Test$YPLLRate^lmda_mdl3, predict(mdljdfinal_trn, Raw95Test))
}

median(pcterr_final)
## [1] 2.629
mean(pcterr_final)
## [1] 2.629
median(ratio_final)
## [1] 0.1703
mean(ratio_final)
## [1] 0.1703

We used Model 3 for testing purposes since it had the highest Adjusted R Squared value and the lowest Normalized RMSE value. The data were split in half for a test set and a train set. The A model was fit using the train data set and this model was used to predict the test data. This was simulated 1000 times and the percent error and normalized RMSE was saved each time.

A histogram of the normalized RMSE and percent error was plotted for each of the 1000 trained models. The mean of the normalized RMSE for the models was 0.17 which is very close to the LOOCV RMSE value of 0.158 for Model 3.

The mean of the percent error was 2.63 chichis less than and more narrowly distributed than for the model whose predictor was not transformed.

It appears that, based on the information above that Model 3 could be useful for predicting the value of Years of Potential Life Lost Rate.